2 research outputs found
A Semismooth Newton Stochastic Proximal Point Algorithm with Variance Reduction
We develop an implementable stochastic proximal point (SPP) method for a
class of weakly convex, composite optimization problems. The proposed
stochastic proximal point algorithm incorporates a variance reduction mechanism
and the resulting SPP updates are solved using an inexact semismooth Newton
framework. We establish detailed convergence results that take the inexactness
of the SPP steps into account and that are in accordance with existing
convergence guarantees of (proximal) stochastic variance-reduced gradient
methods. Numerical experiments show that the proposed algorithm competes
favorably with other state-of-the-art methods and achieves higher robustness
with respect to the step size selection
MoMo: Momentum Models for Adaptive Learning Rates
Training a modern machine learning architecture on a new task requires
extensive learning-rate tuning, which comes at a high computational cost. Here
we develop new adaptive learning rates that can be used with any momentum
method, and require less tuning to perform well. We first develop MoMo, a
Momentum Model based adaptive learning rate for SGD-M (Stochastic gradient
descent with momentum). MoMo uses momentum estimates of the batch losses and
gradients sampled at each iteration to build a model of the loss function. Our
model also makes use of any known lower bound of the loss function by using
truncation, e.g. most losses are lower-bounded by zero. We then approximately
minimize this model at each iteration to compute the next step. We show how
MoMo can be used in combination with any momentum-based method, and showcase
this by developing MoMo-Adam - which is Adam with our new model-based adaptive
learning rate. Additionally, for losses with unknown lower bounds, we develop
on-the-fly estimates of a lower bound, that are incorporated in our model.
Through extensive numerical experiments, we demonstrate that MoMo and MoMo-Adam
improve over SGD-M and Adam in terms of accuracy and robustness to
hyperparameter tuning for training image classifiers on MNIST, CIFAR10,
CIFAR100, Imagenet, recommender systems on the Criteo dataset, and a
transformer model on the translation task IWSLT14.Comment: 25 pages, 11 figure